This workbook was created using the “dataexpks” template:
https://github.com/DublinLearningGroup/dataexpks
This workbook performs the basic data exploration of the dataset.
First we load the dataset.
### _TEMPLATE_
### Data is loaded into dataset rawdata_tbl here
### We may wish to set column typs
data_col_types <- cols(
prem_freq = col_character()
# VAR1 = col_character()
# ,VAR2 = col_date()
# ,VAR3 = col_number()
)
### Data is loaded into dataset rawdata_tbl here
rawdata_tbl <- read_csv(
'data/lifeinsurance_data.csv',
locale = locale(),
col_types = data_col_types,
progress = FALSE
)
glimpse(rawdata_tbl)## Rows: 50,000
## Columns: 27
## $ policy_id <chr> "C004423472", "C008639942", "C006196099", "C0…
## $ countyname <chr> "South Dublin", "Dún Laoghaire-Rathdown", "Du…
## $ edname <chr> "Ballyboden", "Glencullen", "Rathfarnham", "C…
## $ nuts3name <chr> "Dublin", "Dublin", "Dublin", "Mid-West", "Du…
## $ sa_id <chr> "A267013001", "A267092033", "A268127004", "A0…
## $ cluster_id <chr> "n6_c4", "n6_c5", "n6_c4", "n6_c2", "n6_c4", …
## $ prod_type <chr> "protection", "protection", "pension", "prote…
## $ prem_type <chr> "RP", "RP", "RP", "RP", "SP", "RP", "SP", "RP…
## $ prem_freq <chr> "12", "12", "12", "4", NA, "12", NA, "12", "1…
## $ prem_ape <dbl> 1628.33, 174.06, 600.00, 1294.70, 600.00, 679…
## $ prem_risk <dbl> 1163.0933, 124.3313, NA, 924.7867, NA, 485.36…
## $ policy_startdate <date> 2001-01-24, 1990-01-15, 2006-01-22, 1996-08-…
## $ policy_enddate <date> 2011-01-24, 2005-01-15, 2107-05-12, 2016-08-…
## $ policy_duration <dbl> 10, 15, NA, 20, NA, 20, NA, NA, 18, NA, 15, 5…
## $ mort_rating <dbl> 100, 100, NA, 200, NA, 100, NA, NA, 100, NA, …
## $ sum_assured <dbl> 500000, 250000, NA, 100000, NA, 400000, NA, N…
## $ dob_life1 <date> 1968-05-12, 1973-12-23, 1987-05-12, 1960-12-…
## $ gender_life1 <chr> "M", "F", "M", "M", "F", "M", "F", "F", "M", …
## $ smoker_life1 <chr> "N", "N", "S", "S", "Q", "N", "N", "S", "N", …
## $ isjointlife <lgl> FALSE, TRUE, NA, FALSE, NA, FALSE, NA, NA, TR…
## $ islifeonly <lgl> FALSE, TRUE, NA, FALSE, NA, TRUE, NA, NA, TRU…
## $ mortgage_status <chr> "TERM", "MORTDECR", NA, "TERM", NA, "MORTDECR…
## $ lapse_month <dbl> 30, 111, 97, 8, 89, 111, 103, 100, 25, 103, 1…
## $ policy_lapsedate <date> 2003-07-24, 1999-04-15, 2014-02-22, 1997-04-…
## $ policy_status <chr> "lapsed", "lapsed", "lapsed", "lapsed", "info…
## $ policy_statuschangedate <date> 2003-07-24, 1999-04-15, 2014-02-22, 1997-04-…
## $ lapsed <lgl> TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TR…
### _TEMPLATE_
### Do simple datatype transforms and save output in data_tbl
cleaned_names <- rawdata_tbl %>% names() %>% to_snake_case()
data_tbl <- rawdata_tbl %>% set_colnames(cleaned_names)
glimpse(data_tbl)## Rows: 50,000
## Columns: 27
## $ policy_id <chr> "C004423472", "C008639942", "C006196099", "C0…
## $ countyname <chr> "South Dublin", "Dún Laoghaire-Rathdown", "Du…
## $ edname <chr> "Ballyboden", "Glencullen", "Rathfarnham", "C…
## $ nuts_3_name <chr> "Dublin", "Dublin", "Dublin", "Mid-West", "Du…
## $ sa_id <chr> "A267013001", "A267092033", "A268127004", "A0…
## $ cluster_id <chr> "n6_c4", "n6_c5", "n6_c4", "n6_c2", "n6_c4", …
## $ prod_type <chr> "protection", "protection", "pension", "prote…
## $ prem_type <chr> "RP", "RP", "RP", "RP", "SP", "RP", "SP", "RP…
## $ prem_freq <chr> "12", "12", "12", "4", NA, "12", NA, "12", "1…
## $ prem_ape <dbl> 1628.33, 174.06, 600.00, 1294.70, 600.00, 679…
## $ prem_risk <dbl> 1163.0933, 124.3313, NA, 924.7867, NA, 485.36…
## $ policy_startdate <date> 2001-01-24, 1990-01-15, 2006-01-22, 1996-08-…
## $ policy_enddate <date> 2011-01-24, 2005-01-15, 2107-05-12, 2016-08-…
## $ policy_duration <dbl> 10, 15, NA, 20, NA, 20, NA, NA, 18, NA, 15, 5…
## $ mort_rating <dbl> 100, 100, NA, 200, NA, 100, NA, NA, 100, NA, …
## $ sum_assured <dbl> 500000, 250000, NA, 100000, NA, 400000, NA, N…
## $ dob_life_1 <date> 1968-05-12, 1973-12-23, 1987-05-12, 1960-12-…
## $ gender_life_1 <chr> "M", "F", "M", "M", "F", "M", "F", "F", "M", …
## $ smoker_life_1 <chr> "N", "N", "S", "S", "Q", "N", "N", "S", "N", …
## $ isjointlife <lgl> FALSE, TRUE, NA, FALSE, NA, FALSE, NA, NA, TR…
## $ islifeonly <lgl> FALSE, TRUE, NA, FALSE, NA, TRUE, NA, NA, TRU…
## $ mortgage_status <chr> "TERM", "MORTDECR", NA, "TERM", NA, "MORTDECR…
## $ lapse_month <dbl> 30, 111, 97, 8, 89, 111, 103, 100, 25, 103, 1…
## $ policy_lapsedate <date> 2003-07-24, 1999-04-15, 2014-02-22, 1997-04-…
## $ policy_status <chr> "lapsed", "lapsed", "lapsed", "lapsed", "info…
## $ policy_statuschangedate <date> 2003-07-24, 1999-04-15, 2014-02-22, 1997-04-…
## $ lapsed <lgl> TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TR…
We now create derived features useful for modelling. These values are new variables calculated from existing variables in the data.
## Rows: 50,000
## Columns: 27
## $ policy_id <chr> "C004423472", "C008639942", "C006196099", "C0…
## $ countyname <chr> "South Dublin", "Dún Laoghaire-Rathdown", "Du…
## $ edname <chr> "Ballyboden", "Glencullen", "Rathfarnham", "C…
## $ nuts_3_name <chr> "Dublin", "Dublin", "Dublin", "Mid-West", "Du…
## $ sa_id <chr> "A267013001", "A267092033", "A268127004", "A0…
## $ cluster_id <chr> "n6_c4", "n6_c5", "n6_c4", "n6_c2", "n6_c4", …
## $ prod_type <chr> "protection", "protection", "pension", "prote…
## $ prem_type <chr> "RP", "RP", "RP", "RP", "SP", "RP", "SP", "RP…
## $ prem_freq <chr> "12", "12", "12", "4", NA, "12", NA, "12", "1…
## $ prem_ape <dbl> 1628.33, 174.06, 600.00, 1294.70, 600.00, 679…
## $ prem_risk <dbl> 1163.0933, 124.3313, NA, 924.7867, NA, 485.36…
## $ policy_startdate <date> 2001-01-24, 1990-01-15, 2006-01-22, 1996-08-…
## $ policy_enddate <date> 2011-01-24, 2005-01-15, 2107-05-12, 2016-08-…
## $ policy_duration <dbl> 10, 15, NA, 20, NA, 20, NA, NA, 18, NA, 15, 5…
## $ mort_rating <dbl> 100, 100, NA, 200, NA, 100, NA, NA, 100, NA, …
## $ sum_assured <dbl> 500000, 250000, NA, 100000, NA, 400000, NA, N…
## $ dob_life_1 <date> 1968-05-12, 1973-12-23, 1987-05-12, 1960-12-…
## $ gender_life_1 <chr> "M", "F", "M", "M", "F", "M", "F", "F", "M", …
## $ smoker_life_1 <chr> "N", "N", "S", "S", "Q", "N", "N", "S", "N", …
## $ isjointlife <lgl> FALSE, TRUE, NA, FALSE, NA, FALSE, NA, NA, TR…
## $ islifeonly <lgl> FALSE, TRUE, NA, FALSE, NA, TRUE, NA, NA, TRU…
## $ mortgage_status <chr> "TERM", "MORTDECR", NA, "TERM", NA, "MORTDECR…
## $ lapse_month <dbl> 30, 111, 97, 8, 89, 111, 103, 100, 25, 103, 1…
## $ policy_lapsedate <date> 2003-07-24, 1999-04-15, 2014-02-22, 1997-04-…
## $ policy_status <chr> "lapsed", "lapsed", "lapsed", "lapsed", "info…
## $ policy_statuschangedate <date> 2003-07-24, 1999-04-15, 2014-02-22, 1997-04-…
## $ lapsed <lgl> TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TR…
Before we do anything with the data, we first check for missing values in the dataset. In some cases, missing data is coded by a special character rather than as a blank, so we first correct for this.
With missing data properly encoded, we now visualise the missing data in a number of different ways.
We first examine a simple univariate count of all the missing data:
row_count <- data_tbl %>% nrow()
missing_univariate_tbl <- data_tbl %>%
summarise_all(list(~sum(are_na(.)))) %>%
gather("variable", "missing_count") %>%
mutate(missing_prop = missing_count / row_count)
ggplot(missing_univariate_tbl) +
geom_bar(aes(x = fct_reorder(variable, -missing_prop),
weight = missing_prop)) +
xlab("Variable") +
ylab("Missing Value Proportion") +
theme(axis.text.x = element_text(angle = 90))We remove all variables where all of the entries are missing
remove_vars <- missing_univariate_tbl %>%
filter(missing_count == row_count) %>%
pull(variable)
lessmiss_data_tbl <- data_tbl %>%
select(-one_of(remove_vars))With these columns removed, we repeat the exercise.
missing_univariate_tbl <- lessmiss_data_tbl %>%
summarise_all(list(~sum(are_na(.)))) %>%
gather("variable", "missing_count") %>%
mutate(missing_prop = missing_count / row_count)
ggplot(missing_univariate_tbl) +
geom_bar(aes(x = fct_reorder(variable, -missing_prop),
weight = missing_prop)) +
xlab("Variable") +
ylab("Missing Value Proportion") +
theme(axis.text.x = element_text(angle = 90))To reduce the scale of this plot, we look at the top twenty missing data counts.
missing_univariate_top_tbl <- missing_univariate_tbl %>%
arrange(desc(missing_count)) %>%
top_n(n = 50, wt = missing_count)
ggplot(missing_univariate_top_tbl) +
geom_bar(aes(x = fct_reorder(variable, -missing_prop),
weight = missing_prop)) +
xlab("Variable") +
ylab("Missing Value Proportion") +
theme(axis.text.x = element_text(angle = 90))It is useful to get an idea of what combinations of variables tend to have variables with missing values simultaneously, so to construct a visualisation for this we create a count of all the times given combinations of variables have missing values, producing a heat map for these combination counts.
row_count <- rawdata_tbl %>% nrow()
missing_plot_tbl <- rawdata_tbl %>%
mutate_all(function(x) x %>% are_na() %>% vec_cast(integer())) %>%
mutate(label = pmap_chr(., str_c)) %>%
group_by(label) %>%
summarise_all(list(sum)) %>%
arrange(desc(label)) %>%
select(-label) %>%
mutate(label_count = pmap_int(., pmax)) %>%
gather("col", "count", -label_count) %>%
mutate(miss_prop = count / row_count,
group_label = sprintf("%6.4f", round(label_count / row_count, 4))
)
ggplot(missing_plot_tbl) +
geom_tile(aes(x = col, y = group_label, fill = miss_prop), height = 0.8) +
scale_fill_continuous() +
scale_x_discrete(position = "top") +
xlab("Variable") +
ylab("Missing Value Proportion") +
theme(axis.text.x = element_text(angle = 90))This visualisation takes a little explaining.
Each row represents a combination of variables with simultaneous missing values. For each row in the graphic, the coloured entries show which particular variables are missing in that combination. The proportion of rows with that combination is displayed in both the label for the row and the colouring for the cells in the row.
With the raw data loaded up we now remove obvious unique or near-unique variables that are not amenable to basic exploration and plotting.
coltype_lst <- create_coltype_list(data_tbl)
catvar_valuecount_tbl <- data_tbl %>%
summarise_at(coltype_lst$split$discrete, ~ .x %>% unique() %>% length()) %>%
gather("var_name", "level_count") %>%
arrange(-level_count)
print(catvar_valuecount_tbl)## # A tibble: 13 x 2
## var_name level_count
## <chr> <int>
## 1 policy_id 50000
## 2 sa_id 14450
## 3 edname 2792
## 4 countyname 34
## 5 nuts_3_name 8
## 6 cluster_id 6
## 7 prod_type 4
## 8 prem_freq 4
## 9 mortgage_status 4
## 10 smoker_life_1 3
## 11 policy_status 3
## 12 prem_type 2
## 13 gender_life_1 2
## Dataset has 50000 rows
Now that we a table of the counts of all the categorical variables we can automatically exclude unique variables from the exploration, as the level count will match the row count.
unique_vars <- catvar_valuecount_tbl %>%
filter(level_count == row_count) %>%
pull(var_name)
print(unique_vars)## [1] "policy_id"
Having removed the unique identifier variables from the dataset, we may also wish to exclude categoricals with high level counts also, so we create a vector of those variable names.
highcount_vars <- catvar_valuecount_tbl %>%
filter(level_count >= dataexp_level_excl_thresh,
level_count < row_count) %>%
pull(var_name)
cat(str_c(highcount_vars, collapse = ", "))## sa_id, edname
We now can continue doing some basic exploration of the data. We may also choose to remove some extra columns from the dataset.
### You may want to comment out these next few lines to customise which
### categoricals are kept in the exploration.
drop_vars <- c(highcount_vars)
if (length(drop_vars) > 0) {
explore_data_tbl <- explore_data_tbl %>%
select(-one_of(drop_vars))
cat(str_c(drop_vars, collapse = ", "))
}## sa_id, edname
Now that we have loaded the data we can prepare it for some basic data exploration. We first exclude the variables that are unique identifiers or similar, and tehen split the remaining variables out into various categories to help with the systematic data exploration.
## $split
## $split$continuous
## [1] "prem_ape" "prem_risk" "policy_duration" "mort_rating"
## [5] "sum_assured" "lapse_month"
##
## $split$datetime
## [1] "policy_startdate" "policy_enddate"
## [3] "dob_life_1" "policy_lapsedate"
## [5] "policy_statuschangedate"
##
## $split$discrete
## [1] "countyname" "nuts_3_name" "cluster_id" "prod_type"
## [5] "prem_type" "prem_freq" "gender_life_1" "smoker_life_1"
## [9] "mortgage_status" "policy_status"
##
## $split$logical
## [1] "isjointlife" "islifeonly" "lapsed"
##
##
## $columns
## countyname nuts_3_name cluster_id
## "discrete" "discrete" "discrete"
## prod_type prem_type prem_freq
## "discrete" "discrete" "discrete"
## prem_ape prem_risk policy_startdate
## "continuous" "continuous" "datetime"
## policy_enddate policy_duration mort_rating
## "datetime" "continuous" "continuous"
## sum_assured dob_life_1 gender_life_1
## "continuous" "datetime" "discrete"
## smoker_life_1 isjointlife islifeonly
## "discrete" "logical" "logical"
## mortgage_status lapse_month policy_lapsedate
## "discrete" "continuous" "datetime"
## policy_status policy_statuschangedate lapsed
## "discrete" "datetime" "logical"
Logical variables only take two values: TRUE or FALSE. It is useful to see missing data as well though, so we also plot the count of those.
logical_vars <- coltype_lst$split$logical %>% sort()
for (plot_varname in logical_vars) {
cat("--\n")
cat(str_c(plot_varname, "\n"))
na_count <- explore_data_tbl %>% pull(!! plot_varname) %>% are_na() %>% sum()
explore_plot <- ggplot(explore_data_tbl) +
geom_bar(aes(x = !! sym(plot_varname))) +
xlab(plot_varname) +
ylab("Count") +
scale_y_continuous(labels = label_comma()) +
ggtitle(str_c("Barplot of Counts for Variable: ", plot_varname,
" (", na_count, " missing values)")) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
plot(explore_plot)
}## --
## isjointlife
## --
## islifeonly
## --
## lapsed
Numeric variables are usually continuous in nature, though we also have integer data.
numeric_vars <- coltype_lst$split$continuous %>% sort()
for (plot_varname in numeric_vars) {
cat("--\n")
cat(str_c(plot_varname, "\n"))
plot_var <- explore_data_tbl %>% pull(!! plot_varname)
na_count <- plot_var %>% are_na() %>% sum()
plot_var %>% summary %>% print
explore_plot <- ggplot(explore_data_tbl) +
geom_histogram(aes(x = !! sym(plot_varname)),
bins = hist_bins_count) +
geom_vline(xintercept = mean(plot_var, na.rm = TRUE),
colour = "red", size = 1.5) +
geom_vline(xintercept = median(plot_var, na.rm = TRUE),
colour = "green", size = 1.5) +
xlab(plot_varname) +
ylab("Count") +
scale_y_continuous(labels = label_comma()) +
ggtitle(str_c("Histogram Plot for Variable: ", plot_varname,
" (", na_count, " missing values)"),
subtitle = "(red line is mean, green line is median)")
explore_std_plot <- explore_plot + scale_x_continuous(labels = label_comma())
explore_log_plot <- explore_plot + scale_x_log10 (labels = label_comma())
plot_grid(explore_std_plot,
explore_log_plot, nrow = 2) %>% print()
}## --
## lapse_month
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 71.0 95.0 106.9 109.0 420.0
## --
## mort_rating
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 100.0 100.0 150.0 141.8 200.0 325.0 24361
## --
## policy_duration
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 5.00 10.00 20.00 17.27 20.00 35.00 15850
## --
## prem_ape
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 51.8 906.6 2038.4 5033.6 4502.8 470632.5
## --
## prem_risk
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 37.0 611.8 1331.5 3545.6 3160.3 336166.1 24361
## --
## sum_assured
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 100000 200000 300000 443268 450000 5000000 24361
Categorical variables only have values from a limited, and usually fixed, number of possible values
categorical_vars <- coltype_lst$split$discrete %>% sort()
for (plot_varname in categorical_vars) {
cat("--\n")
cat(str_c(plot_varname, "\n"))
na_count <- explore_data_tbl %>% pull(!! plot_varname) %>% are_na() %>% sum()
plot_tbl <- explore_data_tbl %>%
pull(!! plot_varname) %>%
fct_lump(n = cat_level_count) %>%
fct_count() %>%
mutate(f = fct_relabel(f, str_trunc, width = 15))
explore_plot <- ggplot(plot_tbl) +
geom_bar(aes(x = fct_reorder(f, -n), weight = n)) +
xlab(plot_varname) +
ylab("Count") +
scale_y_continuous(labels = label_comma()) +
ggtitle(str_c("Barplot of Counts for Variable: ", plot_varname,
" (", na_count, " missing values)")) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
plot(explore_plot)
}## --
## cluster_id
## --
## countyname
## --
## gender_life_1
## --
## mortgage_status
## --
## nuts_3_name
## --
## policy_status
## --
## prem_freq
## --
## prem_type
## --
## prod_type
## --
## smoker_life_1
Date/Time variables represent calendar or time-based data should as time of the day, a date, or a timestamp.
datetime_vars <- coltype_lst$split$datetime %>% sort()
for (plot_varname in datetime_vars) {
cat("--\n")
cat(str_c(plot_varname, "\n"))
plot_var <- explore_data_tbl %>% pull(!! plot_varname)
na_count <- plot_var %>% are_na() %>% sum()
plot_var %>% summary() %>% print()
explore_plot <- ggplot(explore_data_tbl) +
geom_histogram(aes(x = !! sym(plot_varname)),
bins = hist_bins_count) +
xlab(plot_varname) +
ylab("Count") +
scale_y_continuous(labels = label_comma()) +
ggtitle(str_c("Barplot of Dates/Times in Variable: ", plot_varname,
" (", na_count, " missing values)"))
plot(explore_plot)
}## --
## dob_life_1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## "1899-07-05" "1952-05-21" "1961-09-17" "1961-01-26" "1970-08-14" "1999-08-27"
## --
## policy_enddate
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## "1995-01-05" "2015-08-15" "2026-05-30" "2038-07-21" "2066-02-07" "2117-04-13"
## --
## policy_lapsedate
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## "1990-02-17" "2004-02-16" "2011-01-27" "2011-11-10" "2018-04-13" "2050-12-20"
## --
## policy_startdate
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## "1990-01-01" "1996-05-31" "2002-11-14" "2002-12-12" "2009-06-25" "2015-12-31"
## --
## policy_statuschangedate
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## "1990-02-17" "2003-04-12" "2008-11-11" "2007-09-01" "2012-07-11" "2015-12-31"
We now move on to looking at bivariate plots of the data set.
A natural way to explore relationships in data is to create univariate visualisations facetted by a categorical value.
For logical variables we facet on barplots of the levels, comparing TRUE, FALSE and missing data.
logical_vars <- logical_vars[!logical_vars %in% facet_varname] %>% sort()
for (plot_varname in logical_vars) {
cat("--\n")
cat(str_c(plot_varname, "\n"))
plot_tbl <- data_tbl %>% filter(!are_na(!! plot_varname))
explore_plot <- ggplot(plot_tbl) +
geom_bar(aes(x = !! sym(plot_varname))) +
facet_wrap(facet_varname, scales = "free") +
xlab(plot_varname) +
ylab("Count") +
scale_y_continuous(labels = label_comma()) +
ggtitle(str_c(facet_varname, "-Faceted Barplots for Variable: ",
plot_varname)) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
plot(explore_plot)
}## --
## isjointlife
## --
## islifeonly
## --
## lapsed
For numeric variables, we facet on histograms of the data.
for (plot_varname in numeric_vars) {
cat("--\n")
cat(str_c(plot_varname, "\n"))
plot_tbl <- data_tbl %>% filter(!are_na(!! plot_varname))
explore_plot <- ggplot(plot_tbl) +
geom_histogram(aes(x = !! sym(plot_varname)),
bins = hist_bins_count) +
facet_wrap(facet_varname, scales = "free") +
xlab(plot_varname) +
ylab("Count") +
scale_x_continuous(labels = label_comma()) +
scale_y_continuous(labels = label_comma()) +
ggtitle(str_c(facet_varname, "-Faceted Histogram for Variable: ",
plot_varname)) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
print(explore_plot)
}## --
## lapse_month
## --
## mort_rating
## --
## policy_duration
## --
## prem_ape
## --
## prem_risk
## --
## sum_assured
We treat categorical variables like logical variables, faceting the barplots of the different levels of the data.
categorical_vars <- categorical_vars[!categorical_vars %in% facet_varname] %>% sort()
for (plot_varname in categorical_vars) {
cat("--\n")
cat(str_c(plot_varname, "\n"))
plot_tbl <- data_tbl %>%
filter(!are_na(!! plot_varname)) %>%
mutate(
varname_trunc = fct_relabel(!! sym(plot_varname), str_trunc, width = 10)
)
explore_plot <- ggplot(plot_tbl) +
geom_bar(aes(x = varname_trunc)) +
facet_wrap(facet_varname, scales = "free") +
xlab(plot_varname) +
ylab("Count") +
scale_y_continuous(labels = label_comma()) +
ggtitle(str_c(facet_varname, "-Faceted Histogram for Variable: ",
plot_varname)) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
plot(explore_plot)
}## --
## cluster_id
## --
## countyname
## --
## gender_life_1
## --
## mortgage_status
## --
## nuts_3_name
## --
## policy_status
## --
## prem_freq
## --
## prem_type
## --
## smoker_life_1
Like the univariate plots, we facet on histograms of the years in the dates.
for (plot_varname in datetime_vars) {
cat("--\n")
cat(str_c(plot_varname, "\n"))
plot_tbl <- data_tbl %>% filter(!are_na(!! plot_varname))
explore_plot <- ggplot(plot_tbl) +
geom_histogram(aes(x = !! sym(plot_varname)),
bins = hist_bins_count) +
facet_wrap(facet_varname, scales = "free") +
xlab(plot_varname) +
ylab("Count") +
scale_y_continuous(labels = label_comma()) +
ggtitle(str_c(facet_varname, "-Faceted Histogram for Variable: ",
plot_varname))
plot(explore_plot)
}## --
## dob_life_1
## --
## policy_enddate
## --
## policy_lapsedate
## --
## policy_startdate
## --
## policy_statuschangedate
In this section you can add your own multivariate visualations such as boxplots and so on.
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.0.1 (2020-06-06)
## os Ubuntu 20.04.1 LTS
## system x86_64, linux-gnu
## ui RStudio
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Etc/UTC
## date 2020-10-05
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date lib
## arrayhelpers 1.1-0 2020-02-04 [1]
## assertthat 0.2.1 2019-03-21 [1]
## backports 1.1.8 2020-06-17 [1]
## bayesplot * 1.7.2 2020-05-28 [1]
## blob 1.2.1 2020-01-20 [1]
## broom 0.5.6 2020-04-20 [1]
## callr 3.4.3 2020-03-28 [1]
## cellranger 1.1.0 2016-07-27 [1]
## class 7.3-17 2020-04-26 [2]
## cli 2.0.2 2020-02-28 [1]
## coda 0.19-3 2019-07-05 [1]
## codetools 0.2-16 2018-12-24 [2]
## colorspace 1.4-1 2019-03-18 [1]
## conflicted * 1.0.4 2019-06-21 [1]
## cowplot * 1.0.0 2019-07-11 [1]
## crayon 1.3.4 2017-09-16 [1]
## curl 4.3 2019-12-02 [1]
## DBI 1.1.0 2019-12-15 [1]
## dbplyr 1.4.4 2020-05-27 [1]
## digest 0.6.25 2020-02-23 [1]
## directlabels * 2020.1.31 2020-02-01 [1]
## dplyr * 1.0.0 2020-05-29 [1]
## ellipsis 0.3.1 2020-05-15 [1]
## evaluate 0.14 2019-05-28 [1]
## fansi 0.4.1 2020-01-08 [1]
## farver 2.0.3 2020-01-16 [1]
## forcats * 0.5.0 2020-03-01 [1]
## fs 1.4.1 2020-04-04 [1]
## generics 0.0.2 2018-11-29 [1]
## ggdist 2.1.1 2020-06-14 [1]
## ggplot2 * 3.3.2 2020-06-19 [1]
## ggridges 0.5.2 2020-01-12 [1]
## glue 1.4.1 2020-05-13 [1]
## gower 0.2.2 2020-06-23 [1]
## gridExtra 2.3 2017-09-09 [1]
## gtable 0.3.0 2019-03-25 [1]
## haven 2.3.1 2020-06-01 [1]
## hms 0.5.3 2020-01-08 [1]
## htmltools 0.5.0 2020-06-16 [1]
## httr 1.4.1 2019-08-05 [1]
## inline 0.3.15 2018-05-18 [1]
## ipred 0.9-9 2019-04-28 [1]
## jsonlite 1.6.1 2020-02-02 [1]
## knitr 1.29 2020-06-23 [1]
## labeling 0.3 2014-08-23 [1]
## lattice 0.20-41 2020-04-02 [2]
## lava 1.6.7 2020-03-05 [1]
## lazyeval 0.2.2 2019-03-15 [1]
## lifecycle 0.2.0 2020-03-06 [1]
## loo 2.2.0 2019-12-19 [1]
## lubridate * 1.7.9 2020-06-08 [1]
## magrittr * 1.5 2014-11-22 [1]
## MASS 7.3-51.6 2020-04-26 [2]
## Matrix 1.2-18 2019-11-27 [2]
## matrixStats 0.56.0 2020-03-13 [1]
## memoise 1.1.0 2017-04-21 [1]
## modelr 0.1.8 2020-05-19 [1]
## munsell 0.5.0 2018-06-12 [1]
## nlme 3.1-148 2020-05-24 [2]
## nnet 7.3-14 2020-04-26 [2]
## packrat 0.5.0 2018-11-14 [1]
## PerformanceAnalytics * 2.0.4 2020-02-06 [1]
## pillar 1.4.4 2020-05-05 [1]
## pkgbuild 1.0.8 2020-05-07 [1]
## pkgconfig 2.0.3 2019-09-22 [1]
## plyr 1.8.6 2020-03-03 [1]
## prettyunits 1.1.1 2020-01-24 [1]
## processx 3.4.2 2020-02-09 [1]
## prodlim 2019.11.13 2019-11-17 [1]
## ps 1.3.3 2020-05-08 [1]
## purrr * 0.3.4 2020-04-17 [1]
## quadprog 1.5-8 2019-11-20 [1]
## Quandl 2.10.0 2019-06-12 [1]
## quantmod * 0.4.17 2020-03-31 [1]
## R6 2.4.1 2019-11-12 [1]
## Rcpp 1.0.4.6 2020-04-09 [1]
## RcppParallel 5.0.2 2020-06-24 [1]
## readr * 1.3.1 2018-12-21 [1]
## readxl 1.3.1 2019-03-13 [1]
## recipes 0.1.13 2020-06-23 [1]
## reprex 0.3.0 2019-05-16 [1]
## revealjs 0.9 2017-03-13 [1]
## rlang * 0.4.6 2020-05-02 [1]
## rmarkdown 2.3 2020-06-18 [1]
## rpart 4.1-15 2019-04-12 [2]
## rsconnect 0.8.16 2019-12-13 [1]
## rstan * 2.19.3 2020-02-11 [1]
## rstudioapi 0.11 2020-02-07 [1]
## rvest 0.3.5 2019-11-08 [1]
## scales * 1.1.1 2020-05-11 [1]
## sessioninfo 1.1.1 2018-11-05 [1]
## snakecase * 0.11.0 2019-05-25 [1]
## StanHeaders * 2.21.0-5 2020-06-09 [1]
## stringi 1.4.6 2020-02-17 [1]
## stringr * 1.4.0 2019-02-10 [1]
## survival 3.1-12 2020-04-10 [2]
## svUnit 1.0.3 2020-04-20 [1]
## tibble * 3.0.1 2020-04-20 [1]
## tidybayes * 2.1.1 2020-06-19 [1]
## tidyquant * 1.0.1.9000 2020-09-13 [1]
## tidyr * 1.1.0 2020-05-20 [1]
## tidyselect 1.1.0 2020-05-11 [1]
## tidyverse * 1.3.0 2019-11-21 [1]
## timeDate 3043.102 2018-02-21 [1]
## timetk 2.0.0 2020-05-31 [1]
## TTR * 0.23-6 2019-12-15 [1]
## utf8 1.1.4 2018-05-24 [1]
## vctrs * 0.3.1 2020-06-05 [1]
## withr 2.2.0 2020-04-20 [1]
## xfun 0.15 2020-06-21 [1]
## xml2 1.3.2 2020-04-23 [1]
## xts * 0.12-0 2020-01-19 [1]
## yaml 2.2.1 2020-02-01 [1]
## zoo * 1.8-8 2020-05-02 [1]
## source
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.1)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## CRAN (R 4.0.1)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## CRAN (R 4.0.1)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.1)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.2)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.1)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.2)
## RSPM (R 4.0.0)
## CRAN (R 4.0.1)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## CRAN (R 4.0.1)
## CRAN (R 4.0.1)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## CRAN (R 4.0.1)
## CRAN (R 4.0.1)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.2)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.2)
## RSPM (R 4.0.0)
## RSPM (R 4.0.1)
## RSPM (R 4.0.0)
## RSPM (R 4.0.1)
## CRAN (R 4.0.1)
## RSPM (R 4.0.0)
## RSPM (R 4.0.2)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.2)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## CRAN (R 4.0.1)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.1)
## Github (business-science/tidyquant@017b7d9)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.1)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
## RSPM (R 4.0.0)
##
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library